Lab: Canonical Correlation Analysis

Published

January 16, 2026

M1 MIDS/MFA/LOGOS

Université Paris Cité

Année 2025

Course Homepage

Moodle

Canonical Correlation Analysis

\[C(X,Y) = \mathbb{E}\left[X Y^\top\right]\]

\[\begin{bmatrix} C_{xx} & C_{xy} \\ C_{xy}^{\top} & C_{yy}\end{bmatrix}\]

The first canonical components are the solution of the next problem

NoteOptimization problem

\[\begin{array}{lll}\text{Maximize} & & u^\top C_{xy} v \\\text{subject to} & & u^\top C_{xx} uv=1 =v^\top C_{yy} v \end{array}\]

NoteProposition

Let

\[U \times D \times V^\top\]

be a SVD of

\[C_{xx}^{-1/2} \times C_{xy} \times C_{yy}^{-1/2}\]

The solution to the optimization problem above is

\[a = C_{xx}^{-1/2} u_1 \qquad \text{and} \qquad b= S_{yy}^{-1/2} v_1\]

where \(u_1\) and \(v_1\) are the leading left and right singular vectors of \(C_{xx}^{-1/2} \times C_{xy} \times C_{yy}^{-1/2}\), that is the first column vectors of \(U\) and \(V\).

Proof:

NoteProposition

A sequence of canonical components of \(C_{xy}\) can be obtained from the sequence of (extended) left and right singular vectors of \(C_{xy}\) with respect to \(C_{xx}\) and \(C_{yy}\)

Proof:

NoteProposition

Let \(H_X\) (resp. \(H_Y\)) be orthorgonal projection matrix on the linear space spanned by the columns of \(X\) (resp. \(Y\)).

Canonical correlations \(ρ_1 \geq \ldots \geq \rho_s, \ldots\) are the positive square roots of the eigenvalues \(\lambda_1, \ldots \geq \lambda_s, \ldots\) of \(H_X \times H_Y\) (which are the same as \(H_Y \times H_X\)): \(ρ_s = λ_s\) ˆ Vectors \(U^1, \ldots, U^{p_1}\) are the standardized eigenvectors corresponding to the decreasing eigenvalues \(λ_1 \geq \ldots \geq \lambda_{p_1}\) of \(H_X \times H_Y\)

Vectors \(V^1, \ldots, V^{p_2}\) are the standardized eigenvectors corresponding to the decreasing eigenvalues \(λ_1 \geq \ldots \geq \lambda_{p_2}\) of \(H_X \times H_Y\)

Canonical Correlation Analysis (CCA) in R

cancor() from base package R

Function cancor(x, y, xcenter=T, ycenter=T) computes the canonical correlations between two data matrices x and y. Henceforth we assume that the columns of x and y are centered. Matrices x and y have the same number \(n\) of rows. x (resp. y) has p1 (resp. p2) columns.

The canonical correlation analysis seeks linear combinations of the y variables which are well explained by linear combinations of the x variables. The relationship is symmetric as well explained is measured by correlations.

The result is a list of five components

  • cor correlations.
  • xcoef estimated coefficients for the x variables.
  • ycoef estimated coefficients for the y variables.

Our assumption above allows us to assume xcenter and ycenter are zeros.

The next example is taken from the documentation. Use ?LiveCycleSavings to get more information on the dataset.

Code
LifeCycleSavings |> 
  as_tibble() |>
  slice_sample(n=5)
# A tibble: 5 × 5
     sr pop15 pop75    dpi  ddpi
  <dbl> <dbl> <dbl>  <dbl> <dbl>
1 14.1   23.5  3.73 2631.   2.7 
2 12.6   25.1  4.7  2214.   4.52
3  9     41.3  0.96   88.9  1.54
4 12.7   44.2  1.28  400.   0.67
5  5.13  43.4  1.08  390.   2.96
Code
fm1 <- lm(sr ~ pop15 + pop75 + dpi + ddpi, data = LifeCycleSavings)
 
summary(fm1)

Call:
lm(formula = sr ~ pop15 + pop75 + dpi + ddpi, data = LifeCycleSavings)

Residuals:
    Min      1Q  Median      3Q     Max 
-8.2422 -2.6857 -0.2488  2.4280  9.7509 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) 28.5660865  7.3545161   3.884 0.000334 ***
pop15       -0.4611931  0.1446422  -3.189 0.002603 ** 
pop75       -1.6914977  1.0835989  -1.561 0.125530    
dpi         -0.0003369  0.0009311  -0.362 0.719173    
ddpi         0.4096949  0.1961971   2.088 0.042471 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.803 on 45 degrees of freedom
Multiple R-squared:  0.3385,    Adjusted R-squared:  0.2797 
F-statistic: 5.756 on 4 and 45 DF,  p-value: 0.0007904
Code
pop <- LifeCycleSavings |> 
  dplyr::select(starts_with('pop'))
oec <- LifeCycleSavings |> 
  dplyr::select(-starts_with('pop'))
  
res.cca <- cancor(pop, oec)

res.cca$cor
[1] 0.8247966 0.3652762

This tells us that highest possible linear correlation beween a linear combination of pop15, pop75 and a linear combination of sr, dpi, ddpi is res.cca$cor[1]. The coefficients of the corresponding linear combinations can be found on the rows of components xcoef and ycoef

NoteQuestion

Check that the different components of the output of cancor() satisfy all properties they should satisfy.

NoteQuestion

Design a suite of tests (using testthat) that any contender of the implementation provided by package stats should pass.

Package CCA

Abstract of CCA: An R Package to Extend Canonical Correlation Analysis

Canonical correlations analysis (CCA) is an exploratory statistical method to highlight correlations between two data sets acquired on the same experimental units. The cancor() function in R (R Development Core Team 2007) performs the core of computations but further work was required to provide the user with additional tools to facilitate the interpretation of the results.

As in PCA, CA, MCA, several kinds of graphical representations can be displayed from the results of CCA:

  1. a barplot of the squared canonical correlations (which tells us about the low rank approximations of \(H_X \times H_Y\))
  2. scatter plots for the initial variables \(X^j\) and \(Y^k\) (ako correlation circles)
  3. scatter plots for the individuals (rows)
  4. biplots

Applications

NoteQuestion
  1. Load nutrimouse dataset from CCA.
  2. Insert the 4 elements of list nutrimouse in the global environment (see list2env())
NoteQuestion
  • Compute the cross correlation matrix between gene and lipid
  • Visualize the cross correlation matrix
NoteQuestion
  • Compute the canonical correlations between gene and lipid, save the result in res.cca
  • Check the canonical correlations.
  • Comment
NoteQuestion

Sample 10 columns from gene and lipid and repeat the operation

NoteQuestion

Screeplot

NoteQuestion

Build a correlation circle

References

https://www.jstatsoft.org/article/view/v023i12